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Abstract 

We consider a general high-dimensional additive hazards model in a non-asymptotic 
setting, including regression for censored-data. In this context, we consider a Lasso 
estimator with a fully data-driven l\ penalization, which is tuned for the estimation 
problem at hand. We prove sharp oracle inequalities for this estimator. Our analysis 
involves a new "data-driven" Bernstein's inequality, that is of independent interest, 
where the predictable variation is replaced by the optional variation. 

Keywords. Survival analysis; Counting processes; Censored data; Aalen additive 
model; Lasso; High-dimensional covariates; data-driven Bernstein's inequality 

1 Introduction 

Recent interests have grown on connecting gene expression profiles to survival patients' 
times, see e.g. |30^ [M], where the aim is to assess the influence of gene expressions on 
the survival outcomes. The statistical analysis of such data faces two sorts of problems. 
First, the covariates are high-dimensional: the number of covariates is much larger than 
the number of observations. Second, the survival outcomes suffers from censoring, trun- 
cation, etc. The need of proper statistical methods to analyze such data, in particular 
high-dimensional right-censored data, led in the past years to numerous theoretical and 
computational contributions. 

When the survival times suffer from right-censoring, the problem can be presented 
as follows. For an individual i £ {l,...,n}, let Tj be the time of interest (e.g. the 
patient survival time), let Ci be the censoring time and X; L be the vector of covariates 
in W 1 , assumed to be independent copies of T, C and X = (X 1 , . . . , X d ). We observe 
Zi = Ti A Ci, 5i = l(Tj < Ci) and Xi for i = 1, . . . , n. 

The covariates vector X, where both genomic outcomes and clinical data may be 
recorded, is in high dimension d^> n and influences the distribution of T via its conditional 
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hazard rate given X = x, denned by 



u \ fr\x(t,x) 
a0it ' X)= l-F Tlx (t, X ) 

for t > 0, where fo\x an d Fj>\x are respectively the conditional density and distribution 
functions of T given X = x. In the following, we assume that the conditional hazard 
fulfills the Aalen additive hazards model [I]: 

a (t,x) = A (t) + z T A), Vi>0, 

where Ao is the baseline hazard function and (3q measures the influence of the covariates 
on the conditional hazard function ao- In [21J, an additive hazards model is fitted to 
investigate the influence of the expression levels of 8810 genes on the (censored) survival 
times of 92 patients suffering from Mantel-Cell Lymphoma, see |30| for the data. The 
Aalen additive hazards model is indeed an useful alternative to the Cox model [lUj . in 
particular in situations where the proportional hazards assumption is violated. It can also 
"be seen as a first-order Taylor series expansion of a general intensity" (see [23], p. 103). 

When the aim is then to understand the influence of X on the survival time T, one 
wants to estimate j3o based on the observations. In small dimension d <C n and from the 
data (Zi, 5i, X{)i = i ... n , the least-squares estimator j3 of the unknown /3q is the minimizer 
of the quadratic functional 

Rn(/3) = /3 T H n /3 - 2p T h n , 
where H n is the d x d symetrical positive semidefinite matrix with entries 

- n ^ J ~ zum>t) ) [x - Y.um>t) ) 

and where h n G M d has coordinates 



- - Z^ dt \ i v=^n w 7 r 7 

n ~[ v z^fc=i HZk > Zi 



When d < n and if H n is full rank, we can write 

$ = (H„)- 1 h n , 

see also |19| or [25] . The estimator j3 is -^/n-consistent and asymptotically Gaussian, see 
e.g. 0- 

When X contains genomic outcomes, one typically has d^> n, and the matrix H n is no 
longer of full rank. A sparsity assumption is then natural in this setting: we expect only a 
few genes to have an influence on the survival times, so we expect /3q to be sparse, which 
means that it has only a few non-zero coordinates. Several papers use sparsity inducing 
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penalization in the context of survival analysis, mainly for the Cox multiplicative risks 
model or the Aalen additive risks model, we refer to |35j for a review. Most procedures 
are based on ^-penalization, where one considers 

d 

/3 € argmin j-Rn(/3) + A V) Wj \fy \ j . (1) 

The smoothing parameter A > makes the balance between goodness-of-fit and sparsity, 
and the Wj > 0, j = 1, . . . , d are weights allowing for a precise tuning of the penalization. 
The Lasso penalization corresponds to the simple choice Wj = 1, while in the adaptive 
Lasso [38] one chooses Wj = |/3j|~ 7 where (3j is a preliminary estimator and 7 > a 
constant. The idea behind this is to correct the bias of the Lasso in terms of variable 
selection accuracy, see [38] and [37] for regression analysis. The weights Wj can also be 
used to scale each variable at the same level, which is suitable when some variable has a 
strong variance compared to the others. As a by-product of the theoretical analysis given 
in this paper, we introduce a new way of scaling the variables using data-driven weights 
wj in the i\ penalization, see (fl~4"l) below. 

In the Cox proportional hazards model, R n (P) is the partial likelihood (see e.g. |10] or 
[2]), for which the Lasso, adaptive Lasso, smooth clipped absolute deviation penalizations 
and the Dantzig selector are considered, respectively, in [31], [39] , ]36j, [12] and [3]. 

For the additive risks, |22] considers principal component regression, [21] considers 
a Lasso with a least-squares criterion that differs from the one considered here, |18|. [25] 
considers the ridge, Lasso and adaptive Lasso penalizations and [24] considers the partial 
least-squares and ridge regression estimators. 

A serious advantage, from the computational point of view, in using additive risks over 
multiplicative risks has to be highlighted. Indeed, for the additive risks, the estimating 
Equation ([I]) has a least-squares form, so that one can apply in this case the fast Lars 
algorithm |11] in order to obtain the whole path of solutions of the Lasso, as explained 
in |18j for instance. This point is particularly relevant in practice, since one typically 
uses splitting techniques, such as cross-validation, to select the smoothing parameter, 
or ensemble feature methods, such as stability selection [27], to select covariates. The 
motivations and main contributions of this work are enumerated in the following. 

First motivation. Among the papers that propose some mathematical analysis of 
the statistical properties of estimators of the form ([T]) (upper bounds, support recovery, 
etc.), the results are asymptotic in the number of observations. This can be a problem 
since, in practice, one can not, in general, consider that the asymptotic regime has been 
reached: in |30| . for example, the expression levels of 8810 genes and survival information 
are measured for only 92 patients. Considering only the references that are the closest 
to the work proposed here, the oracle property for the adaptive Lasso is given in [18J, 
which is an asymptotic property about the support and the asymptotic distribution of the 
estimator, and asymptotic normality and consistency in variable selection for the adaptive 
Lasso is proved in [25], where results about the Dantzig selector are also derived using 
the restricted isometry property and the uniform uncertainty principle from [8j. While 
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non-asymptotic results, like sparse oracle inequalities for instance, are now well-known for 
regression or density estimation (see for instance [7J, [5], [I], among many others), such 
results are not yet available for survival data. In this paper, we establish the first results 
of this kind for survival analysis. 

Second motivation. We give sharp oracle inequalities (with leading constant 1) for the 
prediction error associated to the survival problem. The results are stated for general 
counting processes, including the censoring case, while most papers consider censored 
data only. Our results are stated without the assumption that the intensity is linear in the 
covariates. In fact, our Lasso estimator can be computed using an arbitrary dictionary of 
functions, so that one can expect a better approximation of the true underlying intensity. 

Third motivation. In order to prove our results, we need a new version of Bernstein's 
inequality for martingales with jumps, where the predictable variation, which is not ob- 
servable in this problem, is replaced by the optional variation, which is observable. This 
concentration inequality is of independent interest, and could be useful for other statistal 
problems as well. 

Fourth motivation. Finally, and more importantly, our non-asymptotic analysis leads 
to an adaptive data-driven weighting of the £i-norm, that involves the optional variation 
of each element of the dictionary (or of each covariate in the linear case) . More precisely, 
our sharp control of the noise term exhibits the fact that the ^-penalization (see ([1])) 
should be scaled using data-driven weights of order (writing only the dominating terms, 
see Section [3] for details) 



corresponds, roughly, to an estimate of the variance of variable j. Hence, our theoretical 
analysis exhibits a new way of tuning the l\ penalization, by multiplying each coordinate 
by this empirical variance term, in order to make less apparent eventual differences between 
the variability of each X 3 for j = 1, . . . , d. This particular form of weighting, or scaling of 
the variables, was not previsouly noticed in literature. 

The paper is organized as follows. Section [2] describes the model. The Lasso estimator 
is constructed in Section [3l Oracle inequalities for the Lasso are given in Section HI see 
Theorems [1] and [2j Some details about the construction of the least-squares criterion are 
given in Section 16.11 The data-driven Bernstein's inequality is stated in Section [5j see 
Theorem [3j and the proofs of our results are given in Section [6) 

2 High dimensional Aalen model 

Let (f2, P) be a probability space and (J"i)t>o a filtration satisfying the usual conditions: 
increasing, right-continuous and complete (see |14j). Let N be a marked counting process 
with compensator A with respect to (J r t)t>o, so that M = N — A is a (J r t)f>o-martingale. 




where 
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We assume that N is a marked point process satisfying the Aalen multiplicative intensity 
model. This means that A writes 

A(t) = I a (s,X)Y s ds (2) 
Jo 

for all t > 0, where: 

• the intensity ao is an unknown deterministic and nonnegative function called inten- 
sity 

• X G M d is a J-o-measurable random vector called covariates or marks; 

• Y is a predictable random process in [0, 1]. 

With differential notations, this model can be written has 

dN t = a (t, X)Y t dt + dM t (3) 

for all t > with the same notations as before, and taking Ao = 0. Now, assume that we 
observe n i.i.d. copies 

D n = {(Xi,Ni,Yi) :t € [0,r],l <i< n} (4) 

of {(A, N t ,Y t ) : t £ [0, t]}, where r is the end-point of the study. Without loss of generality, 
we set t = 1. We can write 

dNt = a (t, Xi)Yfdt + dM\ 

for any i = 1, . . . ,n where M l are independent (J-f)t>o-martingales. In this setting, the 
random variable N% is the number of observed failures during the time interval [0, t] of 
the individual i. This model encompasses several particular examples: censored data, 
marked Poisson processes and Markov processes, see e.g. [2\ for a precise exposition. In the 
censored case, described in the Introduction, the random processes in D n , see Equation 
are given by 

N\t) = l(Zi <t,Si = 1) and Y\t) = l(Z t > t) 

for i = 1, . . . , n and < t < 1. 

In this paper, we assume that the intensity function satisfies the Aalen additive model 
in the sense that it writes 

a (t,x) = \ (t) + h (x), (5) 

where Ao : M + — > R + is a nonparametric baseline intensity and : W 1 — > R + . Note that 
in the "usual" Aalen additive model, see [El [261 EH [25] , the function ho is linear: 

h (x) = x T /3 , 

where /3o is an unknown vector in M. d . The aim of the paper is to recover the function ho 
based on the observation of the sample D n . 
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3 Construction of an ^-penalization procedure 



3.1 A least-squares type functional 

The problem considered here is a regression problem: we want to explain the influence of 
the covariates Xj on the survival data N l and Y l . Namely, we want to infer on ho, while 
the baseline function Ao is considered as a nuisance parameter. Thanks to the additive 
structure ©, we can construct an estimator of ho without any estimation of Ao, so that 
the influence of the covariates on the survival data can be infered without any knowledge 
on Ao- This classical principle leads to the construction of the partial likelihood in the 
Cox model (multiplicative risks, see [10]) and to the construction of the "partial" least- 
squares (in reference to the partial likelihood) for the additive risks, see [19], which is 
the one considered here. The "partial least-squares" criterion for a "covariate" function 
h : R d -)• M+ is defined as: 

h ^ - V f\h(Xi) - h Y {t)) 2 Yldt - - V f\h(Xi) - h Y {t))dNl (6) 

where 

r m _ E? = i KX t )Yj 

M*J - — V i — • 

It has been first introduced in [19]. The main steps leading to ([6]) are described in Sec- 
tion [6J] below, where we explain why it is indeed suitable for the estimation of ho (see in 
particular Equation (|20l) ). 
Now, we consider a set 

U = {hi,...,h M } 

of functions hj : M. M — > M + , called dictionary, where M is large (M 3> n). The set T~L 
can be a collection of basis functions, that can approximate the unknown h, like wavelets, 
splines, kernels, etc. They can be also estimators computed using an independent training 
sample, like several estimators computed using different tuning parameters, leading to 
the so-called aggregation problem, see [6] for instance. Implicitely, it is assumed that the 
unknown ho is well-approximated by a linear combination 

M 

h p( x ) = ^2Pjhj(x), (7) 
i=i 

where j3 E M M is to be estimated. However, note that we won't assume, for the statements 
of our results, that the unknown ho is equal to /ig for some unknown (3o E M M , hence 
allowing for a model bias. Note that the setting considered here includes the linear case: 
if hj(x) = Xj with d = M, then the estimator has the form h(x) = x T (3. Introducing 

h hY {t) = ^zVy/ 1 and hp, Y (t) = £ (8) 

6 



we define the least-squares risk of /3 G R as 

RniP) = ~jZ l\ h p( X i) ~ hfs, Y (t)) 2 Ytdt - - V f\h p (Xi) - h px {t))dNl (9) 
n i=l ^ n i=l ^ 

which is equal to the functional © where we applied ([7]) . Note that ([9]) is a least-squares 
criterion, since 

BJP) = f3 T H n f3 - 2p T h n , (10) 
where H n is the M x M matrix with entries 

1 - r 1 

(U) Jtk = - V / (^-(Xi) - fy, y (t))(M*i) - h k ,y(t))Ytdt, (11) 

n i=l ^ 

and where h n G R M has coordinates 

n -i 



n ~1 Jo 



i=i J0 

Since H„ is a symetrical positive semidefinite matrix, we can take 

G n = H,J/ 2 , 

so that 

R n (p) = \G n f3\ 2 2 -2(3 T h n , 

where |cc| 2 stands for the i^-norm of x G R n . Note that we will denote by \x\ p the £ p norm 
of x. 

3.2 ^-penalization for the Aalen model 

For the problem considered here, we have seen that the empirical risk R n has to be chosen 
with care. This is also the case for the l\ penalization to be used for this problem. Namely, 
for a well-chosen sequence of positive data-driven weights w = {w\, . . . ,wm), we consider 
the weighted ^-norm 

M 

pen(b) = \b\ 1 ^ = ^2w j \bj\, (12) 
3=1 

and choose (3 according to the following penalized criterion: 

/3 n G argmin \ R n (b) + pen(6) \ (13) 

where B is an arbitrary convex set (typically B = R A/ or B = Rj/, the latter making 
non-negative). The weights considered in (|13p are given by Wj = w(hj) (where we recall 
that hj G H) and where for any function h, we take 



w(h)=ci\ '■ V(h) + c 2 ' n koo, (14) 

V n n 



where: 
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• x > and ci = 2^2, c 2 = 4^14/3 + 2/3, 

• ll^lln.oo = max,; = i r .. in \h(Xi)\, 

• V{h) is a term corresponding to the "observable empirical variance" of h (see below 
for details), given by 

V(h) = -J2 f\h{Xi) - h Y (t)) 2 dNi, 



£n,x(h) is a small technical term coming out of our analysis: 

■6enV(h) + 56x\\h\ 12 



^ W = 21o g log( jj-pjj- Ve) 



Note that the weights Wj are fully data-driven. The shape of these weights comes from a 
new empirical Bernstein's inequality involving the optional variation of the noise process 
of the model, see Theorem in Section [5J below. 

The penalization (|12p is tuned for the estimation problem at hand. It uses the estimator 
V(h) of the (unobservable) predictable quadratic variation 

n i 

and it does not depend on an uniform upper bound for h. As a consequence, it can give, 
from a practical point of view, some insight into the tuning of the ^-penalization. In 
particular, our analysis prove that the j'-th coordinate of /3 in the £\ penalization should 
be rescaled by V(hj) 1 / 2 . Note that this was not previously noticed in literature, in part 
because most results are stated using an asymptotic point of view, see the references 
mentioned in Introduction. 



4 Oracle inequalities 

If /3 E M M , we denote its support by J(/3) = {j E {1, . . . , M} : (3j ^ 0} and its sparsity is 
|/3|o = |«/(/3)| = J2j=i 7^ 0), where 1(^4) is the indicator of A and \B\ is the cardinality 
of a finite set B. If J C {1, ... , M}, we also introduce the vector f3j such that (f3j)j = (3j 
if j € J and (/3j)j = if j G J 1 ", where J 1 " = {1, . . . , M} — J. We define the empirical 
norm of a function h by 

INn = -E [\KXi)-ky(t))*Y*dt, (15) 
71 — In 
i=l J[) 

and remark that = \G n j3W/n. 
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Below are two oracle inequalities for hz. The first one (Theorem [T]) is a "slow" oracle 

inequality, with a rate of order (log M/n) 1//2 , which holds without any assymption on the 
Gram matrix G n . The second one (Theorem [2j) is an oracle inequality with a fast rate of 
order log M/n, that holds under an assumption on the restricted eigenvalues of G n . 

Theorem 1. Let x > be fixed, and let h = hz, where 



fi n e argmin \ R n (b) + pen(6) \, 



beB 

with pen(6) given by (|12p . Then we have, with a probability larger than 1 — 29e~ x : 



\h-ho\\l ^ inf (ll /l /3- /l o||^ + 2pen(/3)). 



Note that 



pen(/3) < max 

j=l,...,M 



X + lo g M + t n>x ( hj ) nhj] 



n 

x + l + logM + £ n , x (h j ) Uh 

T C2 ilj ||n,oo 

n 

for any f3 E M, so the dominant term in pen(/3) is, up to the slow log log term, of order 
|/3|iydogM/n, which is the expected slow rate for h involving the £i-norm (see [5] for the 
regression model and d] for density estimation) . 

For the proof of oracle inequalities with a fast log M/n rate, the restricted eigenvalue 
(RE) condition introduced in [5] and [151 [T6] is of importance. Restricted eigenvalue 
conditions are implied by, and in general weaker than, the so-called incoherence or RIP 
assumptions, which excludes strong correlations between covariates. This condition is 
acknowledged to be one of the weakest to derive fast rates for the Lasso. One can find 
in [33] an exhaustive survey and comparison of the assumptions used to prove fast oracle 
inequalities for the Lasso, where the so-called "compatibility condition", which is slightly 
more general than RE, is described. 

The restricted eigenvalue condition is defined below. Note that our presentation (and 
arguments used in the proof of Theorem [2]) is close to |17| , where oracle inequalities for 
the matrix Lasso are given. Let us introduce, for any /3 £ M. M and Co > 0, the cone 

C/3, co = {be R M : \b JW c\i,w < Co I &/(/?) Im}- (16) 

The cone C/3 iC0 consists of vectors that have a support close to the support of /3. Then, 
introduce 

Mco (/3) = inf{^>0:|6 J(/3) | 2 <-^|G„6| 2 V6 G C^ Co }. (17) 

The number l//i CQ (/3) is an uniform lower bound for \G n b\2/\bj(p)\2 over b G C/3 jCo . Hence, 
it is a lower bound for "eigenvalues" restricted over vectors with a support close to the 
support of f3. Also, note that c i— > /U c (/3) is non-increasing. 
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Theorem 2. Let x > be fixed and let h = hz, where 

fin G argmin \ R n (b) + 2pen(6) \, 

with pen(6) given by (|12p . Then we have, with a probability larger than 1 — 29e~ x : 

9 



hp - hof n ^ ™f (||fyg - Violin + ^3{(3) 2 \wj W \iy 



where 



1^)11= Yl 



Note that 



\wj(p)\ 2 < 2|/3| max 



2 x + logM + ^(/i j ) T> ^ N 



+ C 2 (^ - ||%||n,oo J 



so the dominant term is (up to the log log term) of order |/3|o log M/n. This is the fast 
rate to be found in sparse oracle inequalities [5| 1151 IB]. Moreover, note that the (sparse) 
oracle inequality in Theorem [2] is sharp, in the sense that there is a constant 1 in front of 
the oracle term inf^s \\hp — /iq[|^, see Remark [2] below. 

Now, let us state Theorem [2] under the restricted eigenvalue condition. 

Assumption 1 (RE(s, cq) [5]). For some integer s G {1, . . . , M} and a constant cq > 0, 
we assume that G n satisfies: 

\G n b\ 2 

< k(s,cq) = min min 



JC{1,...,M}, bm M \{0}, y/n\bj\ 2 

\J\<s |b jC |i,tf,<co|6j|i,™ 



Note that using the previous notations, we have 

k(s, cq) = min 



f>eR M \{0} Hc (b)' 

\b\ <S 

Corollary 1. Let x > 0, s G {1, . . . , M} be fixed and let h be the same as in Theorem^ 
Then, under Assumption RE(s,3), we have, with a probability larger than 1 — 29e~ x : 

\\H-h \\l< m (IN-M^ + ^^I^)- 

Ij8[o<s 
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Remark 1. Note that the constant cq = 3 (for /i co (/3)) is used in Theorem [2 This is 
because with a large probability, f3 — (3 belongs to the cone Such an argument of 

cone constraint is at the core of the convex analysis underlying the proof of fast oracle 
inequalities for the Lasso, see for instance [HI \5[ [T7]. 

Remark 2. We were able to prove a sharp sparse oracle inequality (with leading constant 
1), because we adapted in our context a recent argument from [17] . that uses some tools 
from convex analysis (such as the fact that the subdifferential mapping is monotone, 
see [29]) in the study of (3 as the minimum of the convex functional R n + pen. 



5 An empirical Bernstein's inequality 

The proofs of Theorems [T] and [2] require a sharp control of the "noise term" arising from 
model ([3|). For a fixed function h, this noise term is the stochastic process 



1 n r l 

Z t (h) = ~J2 L " h Y (s))dMl 

i=l 







where we recall that Ml = N£ — K\ are i.i.d. martingales with jumps with jumps of size 
+1, as we assume the existence of the intensity function ao, see ([2]). In order to give an 
upper bound on \Zt\ that holds with a large probability, one can use Bernstein's inequality 
for martingales with jumps, see [20j, and note that a proof of this fact is implicit in the 
proof of Theorem El see Section [6] below. Applied to the process Z t (h), this writes 



\Z t {h)\> X l— + ^,V t {h)<v 
( n 6n 



< 2e" 



for any x, v > 0, where 

n t 

V t (h)=n(Z(h)) t = -Y] / (h{X l )-h Y (s)fa (s,X i )Ylds 

is the predictable variation of Zt, which will also be referred to as variance term. But, 
since the term Vt(h) depends explicitly on the unknown intensity ao, one cannot use it 
in the penalizing term of the Lasso estimator. Morever, this result is stated on the event 
{Vt < v} while we would like an inequality that holds in general. Hence, we need a new 
Bernstein's type inequality, that uses an observable empirical variance term instead of 
Vt(h). We prove in Theorem below that we can replace Vt (h) by the optional variation 
of Zt(h), which can be also seen as an estimator of Vt(h) and is defined as: 



\)(h) - n\Z(h)}, = ^Y1 I CX-^)-/'* ^)^/.V. 







Moreover, our result holds in general, and not on {Vt(h) < v}. The counterpart for this is 
the presence of a small log log term in the upper bound for \Zt(h)\. 
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Theorem 3. For any numerical constants q > 1, e > and cq > such that eco > 
2(4/3 + e)c£, the following holds for any x > 0: 



\Zt{h)\ > a\ ■ V t {h) + c 2 



n n 



< c 3 e- x , (18) 



where 

i (h\ i i f ^riV t (h) + 8e(4/3 + e)x\\h\\l QO x 

°n,x(h) = eg log log ( — , , r rr. ||2 Ve), ft n ,oo = . max h(Xi 

V 4(ec - 2(4/3 + e)o>)||/i||£ / i=i,...,n 



and where 



ci = 2x/TT7, c 2 = 2A/2max(c ,2(l + e)(4/3 + e)) + 2/3, 
c 3 = 8 + 6(log(l + e))- c ^. 



5y choosing q = 2, e = 1 and Co = 4(4/3 + e)ci/e = 56/(3e), Inequality (|18p ZioWs wi/i 
i/te following numerical values: 



ci=2\/2, c 2 =4^14/3 + 2/3 < 9.31 
c 3 = 8 + (log 2)~ V + 4 < 28.55, 

,2enV t {h) +mex\\h\\l nc , /3 \ 

^ (ft)=2ioeiog ( — W ve )- 

The concentration inequality (|18p is fully data-driven, since the random variable that 
upper bounds \ Z t {h)\ with a large probability is observable. Note that the numerical values 
given in Theorem [3] are the one used in the construction of the ^-penalization (|12p . These 
are chosen for the sake of simplicity, but another combination of numerical values can be 
considered as well. 

The idea of using Bernstein's deviation inequality with an estimated variance is of 
importance for statistical problems. In [3] for instance, a Bernstein's inequality with 
empirical variance is derived in order to study the Dantzig selector for density estimation. 
Note that, however, we are not aware of a previous result such as Theorem[3]for continuous 
time martingales with jumps, excepted for a work in progress |13| . which uses a similar 
concentration inequalities in the context of point processes. 
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6 Proofs 

6.1 Decomposition of the least-squares 

In this section, we give the details of the construction of the partial least-squares It 
is based on the decomposition, using the additive structure ([5]), of the least-squares risk 
for counting processes depending on covariates, see for instance [25] and [§]• In model ©, 
on the basis of the observations (J1J), the least-squares functional to be considered for the 
estimation of «o is given by 

Ln(a) = -Y] f 1 a 2 (t,X i )Yidt--Y] I' a(t,Xi)dN l t , 
n i=i J ° n i=i J ° 

where a : 1R+ x M. d — > M+. Now, if a(t,x) = \(t) + h(x), we can decompose L n in the 
following way: 

L n (a) = L n>1 (A) + L nj2 (h) + L n , 3 (A, h), (19) 

where 

L nA (X) = -J2 [\\(t) + h Y (t)) 2 Y?dt--f] [\\(t) + h Y (t))dN l t 
LnM h ) = - J2 I (HXi) - h Y (t)) 2 Yl dt--J2 [ (h(Xi) - h Y (t))dNi 
L n , 3 (A, h) = - V / (A(t) + h Y (t))(h(Xi) - h Y {t))Y t l dt, 
where, as introduced in Section [3) 



h Y (t) 



En -yi 
1=1 X t 

Now, the point is that, according to Lemma [1] below, the term L n ^ is zero. 
Lemma 1. For any function h : R rf — > M + and any function ip : R + — > R + ; we have 



n „i 

V / <p(t){h(Xi) - h Y (t))Y? dt = 0. 
i=i Jo 



Lemma Q] follows from an easy computation which is omitted. The term L Ut 2 in f)19|) 
is the partial least-squares criterion considered in Section [3l see Equation ([6]) . We now 
explain why it is suitable for the estimation of Hq. If the Aalen additive model holds, we 
have dNl = (Ao(i) + ho(Xi))Y t l dt + dM\ for alii = 1, . . . , n, so we can write, using again 
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Lemma [TJ 

n r i 



L n , 2 (h) = -Y / (KXi) - h Y (t)) 2 Yi dt 
n i= i J o 

2 n r 1 

- - V / (h(Xi) - h Y (t))(ho(Xi) - h Q , Y (t))Y t l dt 
2 n pi 

(h(Xi)-h Y (t))dMi, 



where 



Now, using the empirical norm || • ||^ defined in Equation (fT5j) . see Section [3] above, we can 



finally write 

n „i 



L n ,2{h) = \\h - h \\ n - \\h 



o 



|2 



-J2 / (h(Xi)-h Y {t))dMi (20) 
n i=i 



The last term in the right hand side of (|20p is a noise term, with tails controlled in SectionO 
above. It is now understood that finding a minimizer of L n< 2, or a penalized version of it, 
is a natural way of estimating h,Q. We refer the reader to [25] for an other justification of 
the "partial least-squares" criterion in the linear case ho(x) = x t /3q. 



6.2 Proof of Theorem [3] 

For i = l,...,n, the processes TV* are i.i.d. counting processes satisfying the Doob- 
Meyer decomposition N% — J ao(s, Xi)Y* ds = M|, see Equation ©. This implies that 
the processes M % are i.i.d. centered martingales, with predictable variation (M' l ) t = 
J * ao(s, Xi)Yg dt and optional variation [M'jt = N£, see e.g. [2j for details. Moreover, the 
jumps of each M l , denoted by AM t * = M\ — M\_ , are in {0, 1}. Introduce the process 




where 

Rl _ h(Xj) - h Y (t) 
* 2max l= i,... i „ \h(Xi)\ 

Note that \H\\ < 1. Since H l is predictable and bounded, the process U is a square 
integrable martingale, as a sum of square integrable martingales. Its predictable variation 
(U) is given by: 

I n rt 

#t = n(U) t = ~J2 (H l s ) 2 a (s, Xt)Y* ds, 
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while its optional variation [U] is given by 



From [32], we know that 



exp(XU t - S x (t)) 
is a supermartingale if S\ is the compensator of 

E t = {exp(AAig-l-AA[/ s }. 

0<s<t 

We now derive the expression of S\. The process E can also be written as 

« = E E f (^M) 1 = E E £?i^t [ ^"K)Y 

s<t k>2 s<t k>2 i=l U 

\k r s 

s<t k>2 i=l JV 



(21) 



where the last inequality holds almost surely, since the M % are independent, hence do not 
jump at the same time (with probability 1). Now, note that 



A / « = (Hl) k AM\s) = (Hl) k AN\s), 



so that we have 



Sx(t) = J2 0(~fffW(s,^)y;& 



with 4>(x) = e x — x — 1. The fact that (|2ip is a supermartingale entails 

Sx(t) , x- 



U, > 



+ 



< e~ x 



A A 

for any A, x > 0. The following facts hold true: 

• (p(xh) < h 2 (j)(x) for any < /i < 1 and x > ; 

• ^( A ) - 2 (i-a/3) for an y A G (°> 3 ) ; 

• min Ae ( 01/ / il ) (y^t + f ) = 2^/ax + 6x, for any a,b,x > 0. 



(22) 
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For any w > 0, they entail the following embeddings: 



r l2wx x „ l r A,,, x „ l 



A,, 



2(n - At.,/3) 
(f)(X w /n) 



C < U t > r n&t + t— , v t < w \ 

(. A w A w J 



+ 

where A^ achieves the infimum. This leads to the standard Bernstein's inequality: 

< e~ x . 



(23) 



, 2wx x „ 
U t > \l + —,$t<w 



n 3n 

By choosing w = cq(x + l)/n for some constant Co > 0, this gives the following inequality, 
which says that when the variance term i?f is small, the sub-exponential term is dominating 
in Bernstein's inequality: 



U t > v / 2^ + 



l\x + l ^ co(x + l) 



n 



< e~ x . 



(24) 



For any < v < w < +oo, we have 



{«>V^ + £}n{.<*<«}cW^ + £}n { .<*<.}, 



so, together with ([22]) and ([23]) . we obtain 



, 2w&tx x 

U t >\ - + —,v<$ t <w 

vn 6n 



< e 



(25) 



Now, we want to replace fit by the observable $t in the deviation (|25p . Note that the 
process Ut given by 

Ut = $ t ~ #t = ~ J2 f( H ^ 2 ( dN s - a °( S > X ^ Y s ds ) 

is a martingale, so following the same steps as for Ut, we obtain that exp([/t — \S\(t)) is 
a supermartingale, with 



&(*) = £/ ^-(Hify^x^ds. 
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Now, writing again (|23p for Ut with the fact that \H\\ < 1 and using the same arguments 
as before, we arrive at 



and 



But, if tit satisfies 



then it satisfies 



and 



. ~ „ . 6(X/n) „ x 

\ti t -ti t \ > ^~y^ u ^ + x 



< 2e~ 



, 2wtitx x 

ti t ~ M > \ — + —,v<ti t <w 

vn in 



< 2e~ x . 



2wtitx x 



1 



v 3/ n 



w ( 1 w 



ti t <2ti t + 2 (~ + \ - - + - +— - 



v V3 v 



2u;\ x 



v / n 



(26) 



simply by using the fact that ^4 < 6 + \/glA entails A < a + 2b for any a, A, b > 0. This 
proves that 



U t < 



2wtifX x 

- + T- 

vn on 



}n{\ti t 



tit] < 



( tt „ c l wx a ( ^ \ w ( w 1 
C |Z7 t <2j — t + [2a/-[- + 



!,'7) 



2wdfX x ' 
— + lTn. 

1\ X 1 



+ 



3/ n . 



(27) 



so using (|2o|) and (|26j) . we obtain 



17* >2 



TO 



w (w 1 



v w 



1\ i 



2a/- - + - )+- )-,v<ti t <w 



3/ n ' 



< 3e~ 



This inequality is similar to (|25|) . where we replaced tit by the observable tit in the sub- 
Gaussian term. It remains to remove the event {v < tit < w} from this inequality. First, 
recall that (|24|) holds, so we can work on the event {tit > cq(x + l)/w} from now on. We 
use a peeling argument: define, for j > 0: 

«j = c (1 + e) J , 

n 

and use the following decomposition into disjoint sets: 

{ti t > V } = (J{Vj <ti t < v j+ i}. 



We have 



x *■ x 

U > C le \j-ti t + C 2 ,e-,Vj <ti t < Vj+l 

w n n 



< 3e~ 



17 



where we introduced the constants 



d, e = and c 2 , e = 2^/(1 + e)(4/3 + e) + 1/3. 



Let us introduce 



£ = C£ log log I — V e 



where q > 1. On the event 



2(l + e)^(x + £) a; + 



+ 



n 



3n 



we have 



* < 2* + 2(4/3 + ef- + 2(4/3 + 6)C ^ loglog(^ V e), 
n n t>o 



which entails, assuming that eco > 2(4/3 + e)cf. 

ft < 



, 2tf t + 2(4/3 + e)- 

eco — 2(4/3 + ejQ V n 

where we used the fact that loglog(x) < x/e — 1 for any x > e. This entails, together 
with (|27p . the following embedding: 



{u, 



< 



2(1 + e)$ t (x + 1) x + l 



+ 



n 



3n 



'}n{|tf 



t-M< 



2(1 + e)$ t (x + £) x + £ 



+ 



n 



3n 



C<U t < ci , 



+ c 2 



V 



where 



ec - 2(4/3 + e)c e 

Now, using the previous embeddings together with (|25p and (j26[) . we obtain 



Q log log ( V e ] . 



'i? t (x + ^) x + l a 

+ C 2 ,e ,V t > VQ 



n 



n 



S E P h 2 V J ST 1 + -to'" < S 



vn 



+ 



v- m r,n „, 1 2(1 + e)-& t (x + £) x + £ 



< 3^ e - x + ^ e -(aH-c*loglog(ty/i!o)) ' 
= 3(l + (log(l + e))- c ^r Q )e- x . 
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Together with (124D. this gives 



l/t > ci, sV h c 3 . 



n n 



< (4 + 3(log(l + 6))- c ^r C£ )e- 



where C3 >e = y2 max(co, 2(1 + e)(4/3 + e)) + 1/3. Now, it suffices to multiply both sides 
of the inequality 



X + £ 8 X+1+ 
U t > CtA l?t + C 3 , ( 



n n 

by 2||/i|| noo to recover the statement of Theorem [3l □ 

6.3 Some notations and preliminary results for the proof of the oracle 
inequalities 

Let us introduce the following notations. Let h(-) = (hi(-), . . . , ^m(-)) T an d hy(-) = 
{hx t Yy)i ■ ■ ■ ,hM,Y{-)) > so that hp = h f3 and hp t y = h Y f3. We will use the notation 
(•, -) n for the following "empirical" inner-product between to functions h,h' : R d — > M + 
(two "covariates" functions): 

(h, h') n = - V / (h(Xi) - h Y (t))(ti(Xi) - h! Y (t))Yi dt, 

and the corresponding empirical norm: 

n „i 



i r 



Note that with these notations, we have: 

/3 T H n /3' = (V,V)n- 

To avoid any possible confusion, we will always write /3 T {3' for the Euclidean inner product 
between two vectors /3 and j3' in 1R M . 
In view of (fTT|) . we can write 

and 

n „i 



1 / 

K = -Y J (h{Xi) - h Y (t))dN}. 
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Now, in view of ([5]) and ([3]), the following holds: 

h n = h' n + Z n , (28) 

where: 



1 n r l 

n i=l J ° 
i n r 1 

{Z n ) j = -J2 (WXi) - h hY (t))dMl 

n i=l - 70 



Using Lemma Q] two times, we obtain: 

n „i 



1 / 

«)i = -E / (^M-^,y(*))(Ao(*) + /»oM)l?* 

n i 

= ~J2 (M*0 -h jt Y(t))ho(Xi)Y?dt 
n i=i Jo 

1 n r 1 

= - E / (M*i) ~ hY(t))(MXi) - ho )Y (t))Y?dt, 



namely 

{h' n )j = (hj,h ) n . (29) 

6.4 Proof of Theorem [1] 

Recall that the empirical risk R n is given by (fTUj) , As a consequence of ([2"g|) and (|2"9"j) . we 
obtain the following decomposition of the empirical risk: 

R n {p) = /3 T H n /3 - 2/3 T /i n = \\hp\\l - 2{hp, h ) n - 2f3 T Z n , 
so, for any j3 G the following holds: 

Rn0) ~ RnW) = \\kfi - ho\\l - \\hf) - ko\\l + 2(/3 - /3) T Z n . 
By definition of /3, we have 

Rn0) + pen(/3) < i? n (/3) + pen(/3) 

for any /3 G R M , so: 

llfy - ho\\ 2 n <\\hp- ho\\l + 2(/3 - P) T Z n + pen(/3) - pen(/3). 
Let us introduce the event 



M 

A = 



f| {2\(Z n )j\ (30) 
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where the weights Wj are given by ()14j) . Using Theorem [3] together with an union bound, 
we have that 

P(A) > 1 - c 3 e~ x , 

where C3 is a purely numerical positive constant from Theorem [3l On A, we have 

M 

\2(ft - ft) T z n \ < ~ Pi\ = \P ~ Ph,*> 

so recalling that pen(/3) = YljLi WjlPjli we obtain 

\\hg - Mn ^ \\ h P ~ Mn + 2pen(/3) 
for any ft G M M , which is the statement of Theorem [TJ □ 

6.5 Proof of Theorem [2] 

Recall the following notation: for any J C {1, . . . , M} and x G R A:f , we define the vector 
ij £ M M with coordinates by (xj)j = Xj when j G J and (xj)j = if jf G J^, where 
J C = {1, . . . , M} - J. Recall that 

ft G argmin J R n (6) + 2pen(6) , (31) 

where is a convex set. This proof uses arguments from [17J. We denote by d(p the 
subdifferential mapping of a convex function <j). The function b 1— >• R n (b) is differentiable, 
so the subdifferential of i? n (-) + 2pen(-) at a point b G M M is given by 

d(R n + 2 pen) (6) = {Vi? n (6)} + 2<9pen(6) = {2U n b - 2h n } + 2<9pen(6). 

So, Equation ([3Tj) means that there is /% G <9pen(/3) such that VR n (ft) + 2/3q belongs to 
the normal cone of B at ft: 

(2H n ft-2h n + 2ft d ) T (ft- ft) <0 V/3G5. (32) 

Inequality (j32|) can be written, using ([28]) and ([29]) . in the following way: 

2(h $ - hp, h~ p - h )n + 20 a - ft a ) T (ft -ft)< -2ftJ(ft -(3) + 2Zl(ft - ft), 

where chose any ftg G dpen(ft). Now, we use the fact that the subdifferential mapping is 
monotone (this is an immediate consequence of its definition, see [29J, Chapter 24, p. 240) 
to say that (ftg — ftg) 1 (ft — ft) > 0. Moreover, it is standard to see that 

d\b\i,w = \ e + f : ej = WjSgn.(bj) and f J{b) =Q,\fj\ < Wj for any j = 1, . . . , M\, 
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where J(b) = {j : bj / 0}. Let (3 £ B be fixed, and denote J = J(/3) = {j : f3j / 0}. 
Consider e and / such that /3$ = e + /, with ejc = 0. We have |e T (/3 — j3)\ < \f3j — Pj\\,w 
and we can take / such that f T (P — /3) = f T PjC = |/3jc|i,w- This gives 

2(h $ - hp, h $ - h ) n + 20 jC \t^ < 2\Pj - + 2Z T n {p - P). 

Using Pythagora's Theorem, we have 

2{h0 - h , h~ p - h p ) n = \\hp - hof n + \\hp - h p \\ 2 n - \\hp - h \\ 2 n , (33) 

so 

11^-^112+11^-^112+21^1^ 

< \\hp ~ h \\ 2 n + 2\0j - 0j\ ltib + 2Zi{fi - P). 

If (h^ — ho, hp — hf3) n < 0, we have \\hg — ho\\ n < \\hp — ho\\ n , which entails the Theorem, 
so we assume that (h& — ho, hp — hp) n > 0. In this case 

2|/3 7 c|i,<8 < 2(hp - h ,hp - h p ) n + 2|/3 jC |i^ < 2\pj - pj\ x ^ + 2Z T n (p - p), 
which entails, together with the fact that, on A (see (|30p ). we have 

2\Z^0 - 0)\ = 2\(Z n ) T j0j - Pj)\ + 2|(Z n )J c /3 jC | < \Pj - (3j\^ + \P jC \^, 

that 

l^joli,^ < 3|^j--^j|i >t& . 
This means that ft — j3 G (see (fT6|) ). So, using ([TTJ) , we have 

\$j-Pj\2<nM\G n 0-P)\2- (34) 

Note that, on A, we have: 

\\h $ - ho\\ 2 n + \\h $ - hf,\\l + |/3 jC |i^ < \\hp - ho\\l + 3\pj - 
A consequence of (fM|) is 

\Pj - Pj\x,w < - PjU < ^(P)\wj\ 2 \G n (p - p)\ 2 , 

so we arrive at 

\\hp - Mn < \\ h P ~ h o\\l + ?>^{P)\wj\2\\hp - hp\\ n ~ ~ hpf n , 

and finally 

\\hp - ho\\l < \\hp - h Q \\l + \^{P?\wj\1 
using the fact that ax — x 2 < a 2 /4 for any a, x > 0. □ 
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